Add io-uring based ObjectStore for local file I/O#21673
Add io-uring based ObjectStore for local file I/O#21673Dandandan wants to merge 9 commits intoapache:mainfrom
Conversation
Introduces `datafusion-object-store-iouring`, a new crate that provides an `IoUringObjectStore` using Linux's io_uring interface for high-performance local file reads. A dedicated thread runs an io_uring event loop, and read requests (`get_opts`, `get_ranges`) are dispatched via channels — enabling batched syscalls where multiple byte-range reads (e.g., Parquet column chunks) are submitted in a single `io_uring_enter()` call instead of individual `pread()` calls. Key design: - Dedicated io_uring worker thread with a 256-entry submission queue - Unbounded mpsc channel for requests, oneshot channels for responses - Range reads batched per-request; chunked if exceeding ring capacity - Write/list/copy/delete operations delegated to LocalFileSystem - On non-Linux platforms, all operations fall back to LocalFileSystem - Feature flag `io-uring` on `datafusion-execution` to opt in Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resolve merge conflicts with upstream (version 53.0.0, object_store 0.13.2) - Update ObjectStore trait impl for 0.13.2 API changes: - copy/copy_if_not_exists → copy_opts(CopyOptions) - delete → delete_stream (required method) - PutMultipartOpts → PutMultipartOptions - Import ObjectStoreExt for head() convenience method - Enable io-uring feature by default in datafusion-execution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (0bf32dd) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (0bf32dd) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (0bf32dd) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
- Fix missing `StreamExt` import for `.boxed()` on Linux code path - Fix `GetResult.range` type: `Range<u64>` in object_store 0.13.2 - Add `io-uring` feature to datafusion core, forwarding to execution - Add to core's default features so benchmarks get it automatically Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runners and Docker containers often block io_uring_setup via seccomp filters (EPERM). Instead of failing hard, probe availability at construction time and gracefully fall back to LocalFileSystem for all read operations when io_uring cannot be initialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (87539f6) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (87539f6) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (87539f6) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
- Remove unnecessary `as u64` cast (meta.size is already u64 in 0.13.2) - Allow clippy::result_large_err on execute_read_ranges (object_store::Error is large by design) - Fix broken rustdoc links to LocalFileSystem (cfg-gated away when io-uring feature is enabled) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds an init container that writes a seccomp profile (default allowlist + io_uring_setup/enter/register) to the node's kubelet seccomp dir. The runner container then references it via Localhost seccomp type. No DaemonSet or cluster-wide changes needed — it's self-contained in each benchmark Job.
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds a checked-in seccomp profile (services/seccomp/io-uring-allowed.json) that's the standard containerd default allowlist plus three syscalls: io_uring_setup, io_uring_enter, io_uring_register. Deployment: - A Pulumi DaemonSet (services/seccomp.ts) copies the profile from a ConfigMap to /var/lib/kubelet/seccomp/profiles/ on every node - The controller and benchmark-main workflow reference it via seccompProfile.type: Localhost No init containers needed — the DaemonSet keeps the profile present on all nodes continuously.
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds a checked-in seccomp profile (services/seccomp/io-uring-allowed.json) that's the standard containerd default allowlist plus three syscalls: io_uring_setup, io_uring_enter, io_uring_register. Deployment: - A Pulumi DaemonSet (services/seccomp.ts) copies the profile from a ConfigMap to /var/lib/kubelet/seccomp/profiles/ on every node - The controller and benchmark-main workflow reference it via seccompProfile.type: Localhost No init containers needed — the DaemonSet keeps the profile present on all nodes continuously.
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds a checked-in seccomp profile (services/seccomp/io-uring-allowed.json) that's the standard containerd default allowlist plus three syscalls: io_uring_setup, io_uring_enter, io_uring_register. Deployment: - A Pulumi DaemonSet (services/seccomp.ts) copies the profile from a ConfigMap to /var/lib/kubelet/seccomp/profiles/ on every node - The controller and benchmark-main workflow reference it via seccompProfile.type: Localhost No init containers needed — the DaemonSet keeps the profile present on all nodes continuously. Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
|
run benchmarks |
1 similar comment
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (5df85eb) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (5df85eb) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (5df85eb) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Mirror the `datafusion-object-store-iouring` crate and registry wiring from apache#21673 into this branch so we can iterate on the io_uring path here.
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (5df85eb) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (5df85eb) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (5df85eb) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
Roll back the `uring::is_available()` probe and the `Option<UnboundedSender>` machinery added in 87539f6 so that the store is back to unconditionally using io_uring on Linux. Silently falling back to `LocalFileSystem` when io_uring_setup is blocked (EPERM, seccomp, etc.) hides real configuration problems. Follow-up commits address the correctness and performance gaps in the io_uring path itself.
- Path resolution: `self.root.join(location.as_ref())` joined the raw percent-encoded `object_store::Path` string, which diverges from the `LocalFileSystem` prefix/decoding logic on roots that aren't `/` and on any segment containing characters `object_store` encodes. Route resolution through `LocalFileSystem::path_to_filesystem` so puts and io_uring reads agree. - Partial reads: the drain loop previously `truncate(bytes_read)`'d on the first completion, silently returning short data. Track per-range fill progress and resubmit reads until the full range is covered or the kernel reports an error / unexpected EOF. - CQE bounds: `user_data` is cast to `usize` and used to index the buffer vector. Reject user_data values outside the buffer range instead of panicking or reading past the allocation. - CQE drain wait: if fewer than the expected number of CQEs are available after the initial `submit_and_wait`, wait for the remainder instead of busy-spinning on `completion()`. - `GetOptions` conditionals: `if_match`, `if_none_match`, `if_modified_since`, `if_unmodified_since`, and `version` were silently dropped. Send them to `LocalFileSystem::get_opts` with `head = true` so preconditions are enforced before we issue io_uring reads. - Panics: constructors now return `object_store::Result<Self>` instead of `expect`ing on invalid prefixes or thread-spawn failures; the SQ push retry returns an error on repeated failure rather than `expect`ing. - Misc: reject ranges with `start > end`, reject range lengths that exceed `usize`, and skip io_uring-backed unit tests cleanly when `io_uring_setup` is blocked by the sandbox.
The previous worker popped one `IoCommand` at a time, did
`submit_and_wait(n)` for its ranges, drained, and only then looked at
the next command. Concurrent async callers were effectively serialized
through that single `submit_and_wait`, negating io_uring's main win —
overlapping in-flight reads.
Rework the worker around a slab of `PendingRequest`s keyed by a
`u32` slot id. SQE `user_data` is `(slot << 32) | range_idx`, so every
completion finds its owning request and buffer directly. The loop:
1. Pull new commands until `MAX_IN_FLIGHT_REQUESTS` is reached (or
block on the channel when nothing is in flight).
2. Push queued SQEs into the ring until it is full.
3. `submit_and_wait(1)` when work is outstanding, then drain all
available CQEs. Partial reads re-queue a follow-up SQE.
Slots are only freed once `sqes_outstanding == 0`, so a late CQE can
never scribble into a reallocated slot's buffers.
Also:
* Switch the command channel to a bounded `mpsc::channel(1024)` so
bursts of requests apply backpressure rather than ballooning memory.
* Drop the redundant in-function chunking loop — the slab + backlog
already handles overflow.
* Add a multi-threaded integration test that fires 16 concurrent
`get_ranges` calls to assert the pipelining does not deadlock and
returns correct data.
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (7fdb92b) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (7fdb92b) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (7fdb92b) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Switch the command channel to unbounded so senders never await, and trim hot-path overhead in the worker: - VecDeque backlog avoids the O(N) drain of the front. - Buffers grow via set_len as completions arrive, removing the zero-fill cost of vec![0u8; n] without reading uninit memory. - submit_and_wait(1) alone both flushes SQEs and waits, saving a syscall per loop iteration. - Drop the fixed in-flight cap; backpressure now comes from the ring and CQ drain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (da7abc7) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (da7abc7) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (da7abc7) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
Local file reads in DataFusion use
object_store::local::LocalFileSystem, which issues onepread()per byte range. For Parquet column chunk reads this means many individual syscalls. Linux's io_uring allows batching these into a singleio_uring_enter().What changes are included in this PR?
New crate
datafusion-object-store-iouringprovidingIoUringObjectStore:get_ranges()submits all byte ranges as SQEs in oneio_uring_enter()LocalFileSystemwhen unavailable (EPERM in Docker/seccomp, non-Linux platforms)io-uringondatafusion-executionanddatafusion, enabled by defaultArchitecture:
To actually enable io_uring in benchmark environments, a companion PR adds a custom seccomp profile: adriangb/datafusion-benchmarking#4.
Are these changes tested?
6 unit tests covering put/get round-trip, single and multi-range reads, head, list, and empty-range edge cases. The io_uring code path requires a Linux host with io_uring support; on macOS CI the fallback path is exercised.
Are there any user-facing changes?
New optional crate and default feature flag. No behavioral changes when io_uring is unavailable — transparent fallback to
LocalFileSystem.